import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from imblearn.over_sampling import SMOTE, ADASYN
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import precision_recall_curve, roc_auc_score
from sklearn.metrics import fbeta_score,accuracy_score
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings('ignore')
Input variables:
1 - age (numeric)
2 - job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
3 - marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
4 - education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
5 - default: has credit in default? (categorical: "no","yes","unknown")
6 - housing: has housing loan? (categorical: "no","yes","unknown")
7 - loan: has personal loan? (categorical: "no","yes","unknown")
8 - contact: contact communication type (categorical: "cellular","telephone")
9 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
10 - day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: "yes","no")
We wanted to use historical telemarketing data to simulate the customer's final purchase decision. The finalised model can then predict the outcome of future campaigns, allowing the salesperson to optimise the calling sequence and thus improve efficiency and the total purchase rate. The primary metrics we want to optimize is accuracy and precision.
This data is obtained from the UCI repository. A Portuguese banking institution provides this data, and it contains results of direct marketing campaigns for term deposits based on phone calls in 10 months. According to investopedia, a term deposit is a financial product that guarantees a fixed interest rate for your deposit.
df_bank_raw= pd.read_csv("bank-additional-full.csv",sep = ";")
df_bank_raw.drop(columns = ['duration'], inplace = True)
df_bank_raw.rename({'y': 'outcome'}, axis=1, inplace=True)
df_bank = df_bank_raw.copy()
df_bank_raw.head(3)
df_bank_raw.info()
df_bank_raw.describe()
def check_categorial_values(data):
for c in data.columns:
if data[c].dtype == 'object':
unique_vals = data[c].unique()
print(c,"-"*5,unique_vals)
check_categorial_values(df_bank)
categorical_columns = []
numerical_columns = []
for c in df_bank:
if df_bank[c].dtype == 'object':
categorical_columns.append(c)
else:
numerical_columns.append(c)
print("categorical_columns - \n",categorical_columns,"\n")
print("numerical_columns - \n",numerical_columns)
The dataset is Labelled.
We should use supervised learning model for this project.
Binary categorical target variable -'outcome'.
The final model type should be classifier.
This dataset contains 11 categorical columns, and each has many labels.
Sklearn's algorithm cheatsheet suggested using Linear SVC and KNN classifier for this data. However, although SVM-based models such as linear SVC may obtain good results, running and tuning for such a model will be too time-consuming.
The one-hot encoded ML-ready data will have about 60 columns. Hence, we will avoid the SVM model in the model testing stage.
The value range for each column differs greatly, and some attributes have outliers.
Due to computation limitations, we will not use any distance-based method except for the KNN classifier just for comparison purposes. We expect to use tree-based algorithms and boosting algorithms for this project, which are not affected by outliers nor large values. Hence, we will skip feature scaling and remove outliers for later data engineering.
#use pandas profiling to gain some basic observations
report = ProfileReport(df_bank_raw)
report.to_file(output_file='data_report.html')
#Please see data_report.html for the full file if the output is not complete
report
df_eda = df_bank_raw.copy()
# Change successful to 1, unsuccessful to 0
df_eda['outcome'] = (df_eda['outcome'] == 'yes').astype(int)
# This is a function that computes the number and percentage of successful given a variable
def percent_table(data, column, target='outcome'):
new_table = data.groupby(column)[target].agg({'sum','count'})
new_table['percentage'] = new_table['sum']/new_table['count']
return new_table
fig, ax = plt.subplots(1,2, figsize = (12,5))
# Plot the Euribor 3m Distribution of our data and split it to two parts
sns.histplot(df_eda['euribor3m'],ax=ax[0], alpha = 0.6)
ax[0].set_title('Euribor 3m Distribtion', fontsize =14, pad=12)
ax[0].set_xlabel('euribor3m(%)')
ax[0].axvline(3, linestyle='--')
# Plot the percentage of Success by whether Euribor 3m greater 3 or not
df_eda['euribor3mgreater3'] = (df_eda['euribor3m'] > 3).astype(int)
sns.barplot(data=percent_table(df_eda, 'euribor3mgreater3').reset_index(),
x='euribor3mgreater3',y='percentage', ax=ax[1], alpha = 0.75)
ax[1].set_title('Success Rate for Euribor 3m Greater than 3 or Not', fontsize =14, pad=12)
ax[1].set_xlabel('Euribor 3m Greater than 3 or Not')
plt.subplots_adjust(wspace = 0.4) # Adjust the space between plots
plt.show()
# provide numerical details for above plots
percent_table(df_eda, 'euribor3mgreater3').style.background_gradient()
From the pandas profiling report, we notice that Age, Job and Marital status are more correlated to the marketing outcome than the other.
First, let's display the average success rate of different category group for each attributes using coloured dataframe.
# provide numerical details for plots
percent_table(df_eda, 'job').style.background_gradient()
# provide numerical details for plots
percent_table(df_eda, 'marital').style.background_gradient()
We will not discuss the 'unknown' group becasue the sample size is too small.
# Partition the age to 7 different age ranges
df_eda['age_bins'] = pd.cut(x=df_eda['age'], bins=[0,20,25,30,45,55,60,120])
percent_table(df_eda, 'age_bins').style.background_gradient()
After discreterizing the 'age' attribute into 7 age groups, we found people under 20s and people who are greater than 60s have much higher purchase rate of more than 40%.
Although the size of the (0,20] group is much smaller than the others, but we can still see that purchase rate is generally higher in younger groups, and as the age goes smaller the higher the purchase rate.
# Create a new variable that if this person been married before
df_eda['been_married'] = ((df_eda['marital'] == 'divorced') | (df_eda['marital'] == 'married')).astype(int)
fig,ax = plt.subplots(3,1, figsize = (12,25))
# Plot the age distribution of different jobs
sns.boxplot(data= df_eda[['age','job']],x='job',y='age',ax=ax[0])
ax[0].tick_params(labelrotation=45)
ax[0].set_title('Age Distribution of different jobs', fontsize=15, pad=12)
# Plot the age distribution of different marital status (been married or not)
sns.histplot(data=df_eda,x='age',hue = 'been_married',ax=ax[1])
ax[1].set_title('Age Distribution (been married or not)', fontsize =15, pad=12)
# Plot the successful rate for different ages
sns.barplot(x='age_bins',y='percentage',
data = percent_table(df_eda, 'age_bins').reset_index(),
palette = 'Set2',ax=ax[2])
ax[2].set_title('Successful Rate at Differnt Age Range', fontsize = 15, pad =12)
plt.subplots_adjust(hspace = 0.35) # Adjust the space between plots
plt.show()
From the third plot, we observed that the previous findings are correlated to this graph:
The real reason 'retired' and 'student' groups have a higher purchase rate is because their average age 20 and 60, each falls into the (0,20] and (60,120] age groups.
The 'never_married' and 'been_married' group has an average age of 30 and 35, respectively. And thus, the purchase rate of the 'never_married' or 'single' group (14%) is similar to the purchase rate of the (25,30] group (13.5%). The purchase rate of the 'been_married' group (10%) is similar to that of the (30,35] group (9%).
campaign_df = percent_table(df_eda, 'campaign')
fig,ax = plt.subplots(1,2, figsize = (14,6))
# group the data of campaign number >= 16 to a single row
campaign_df.loc[16] = campaign_df.loc[16:].sum()
campaign_df = campaign_df.loc[:16]
campaign_df['percentage'] = campaign_df['sum']/campaign_df['count']
campaign_df.rename(index={16:'>= 16'},inplace=True)
#Plot the successful rate at different number of contacts during this compaign
sns.barplot(data=campaign_df.reset_index(),x='campaign',y='percentage',
palette='Blues',ax=ax[0])
ax[0].set_title('Success Rate vs \n number of contacts during this compaign', fontsize =14)
#Plot the successful rate at different number of contacts previous to this compaign
previous_df = percent_table(df_eda, 'previous')
sns.barplot(data=previous_df.reset_index(),x='previous',y='percentage',
palette='Greens',ax=ax[1])
ax[1].set_title('Success Rate vs \n number of contacts previous to this compaign', fontsize =14)
plt.show()
The above findings could be used to make four suggestions to the bank marketers:
Below, we will use 'df_bank' to demonstrate why we do each step and prove that data engineering will positively impact prediction accuracy.
To avoid one-time codes, We will encapsulate all cleaning and engineering steps in functions so that we can reuse the functions if new data is available.
def clean_data(data):
data["outcome"] = data["outcome"].replace({"yes":1, "no":0})
return data
df_bank_raw = clean_data(df_bank_raw)
df_bank = clean_data(df_bank)
#replace unknown to null values
cate_cols = categorical_columns
cate_cols = cate_cols[:-1]
def clear_unknown(df):
for c in cate_cols:
df[c] = df[c].replace({"unknown":np.nan})
return df
df_bank = clear_unknown(df_bank)
When assessing the missingness in data, we found that about 1/4 of the records contains at least 5% of null values.
Because the number of incomplete rows is significant, we cannot drop these rows straight away without knowing the pattern of missingness.
# How much data is missing in each row of the dataframe?
nulls_in_row = df_bank.isnull().sum(axis=1)/len(df_bank.columns)
n, bins, patches = plt.hist(nulls_in_row, 10, facecolor='red', alpha=0.4)
plt.show();
We noticed that most of the null values are from the 'default', 'education', 'housing' and 'loan' columns.
The missing values are concentrated in particular columns such as 'education'. Therefore, we can use the most common category to replace the null values and consider removing the incomplete records after analysing the importance of the attributes.
# missing pattern in columns
pd.DataFrame(df_bank.isnull().sum(axis=0)/len(df_bank), columns = ['pct_of_missing']).sort_values(by = 'pct_of_missing', ascending = False).style.background_gradient()
# missing pattern in rows.
plt.subplots(figsize=(10,10))
sns.heatmap(df_bank.isnull(), cbar=False);
We adopt a combination of frequent categorical variable imputation and adding a new importance column for denoting the imputed value. We used this approach for the following reasons:
def mark_nan(df,col):
df[col+"_imputed"] = df[col]
df[col+"_imputed"].loc[~df[col+"_imputed"].isnull()] = 0
df[col+"_imputed"].loc[df[col+"_imputed"].isnull()] = 1
return df
df_bank = mark_nan(df_bank, "education") #run once!!!
def impute_value(df):
for c in cate_cols:
if df[c].isnull().sum() > 0:
most_frequent = df[c].mode()[0]
df[c].fillna(most_frequent, inplace = True)
else:
continue
return df
df_bank = impute_value(df_bank)
def split_XY(df):
X = df.loc[:, df.columns != 'outcome']
Y = df['outcome']
return X, Y
def split_data(X,Y):
x_train, x_test, y_train, y_test= train_test_split(X, Y, test_size = 0.2, random_state = 42)
return x_train, x_test, y_train, y_test
#engineered data
x_eng, y_eng = split_XY(df_bank)
# one hot encoding
x_eng = pd.get_dummies(x_eng)
x_train_eng, x_test_eng, y_train_eng, y_test_eng = split_data(x_eng, y_eng)
This dataset contains 20 columns, and 11 of them are categorical. Therefore, one potential problem is the long execution time and over-fitting brought by the high dimension of the one-hot encoded data.
We will use the random forest algorithm's feature importance to remove columns that are not contributing a lot to the prediction. For example, from the below feature importance ranking, we can see the importance for 'default_no', 'default_yes', and 'education_illiterate' is close to zero; hence, we can drop these variables.
To ensure the feature selection is adequate, we will compare the accuracy score obtained by predicting the original and selected data using cross-validation below.
#predict on orginal data using random forest model
rfc = RandomForestClassifier(random_state = 42)
rfc = rfc.fit(x_train_eng,y_train_eng)
cv_results = cross_validate(rfc, x_train_eng, y_train_eng, cv=3, scoring = ("accuracy","precision",'f1','roc_auc'))
cv_results
avg_accuracy = cv_results['test_accuracy'].mean()*100
print(f'The orginal RFC model has an avergae accuracy of {avg_accuracy:.2f}%')
feat_importances = pd.Series(rfc.feature_importances_, index=x_train_eng.columns).sort_values(ascending=True)
feat_importances.plot(kind='barh',figsize=(15,12));
plt.show();
# remove the low importance attributes
x_train_eng = x_train_eng.drop(columns = ['default_no', 'default_yes','education_illiterate'])
x_test_eng = x_test_eng.drop(columns = ['default_no', 'default_yes','education_illiterate'])
rfc = RandomForestClassifier(random_state = 42)
rfc = rfc.fit(x_train_eng,y_train_eng)
cv_results = cross_validate(rfc, x_train_eng, y_train_eng, cv=3, scoring = ("accuracy","precision",'f1','roc_auc'))
cv_results
avg_accuracy = cv_results['test_accuracy'].mean()*100
print(f'The new RFC model has an average accuracy of {avg_accuracy:.2f}%')
From below, we can see that we have 8 times more unsuccessful outcomes than successful ones. The imbalance in the data may result in bias in the model.
Hence, here we will use two ways to re-sample the data and compare the results:
#Discover imbalance in the dataset
outcome_1 = len(df_bank[df_bank['outcome'] == 1])
outcome_0 = len(df_bank[df_bank['outcome'] == 0])
print(f'{outcome_1} of the records have Postive results, accounts for {outcome_1/len(df_bank)*100:.2f}% of the data')
print(f'{outcome_0} of the records have Negative results, accounts for {outcome_0/len(df_bank)*100:.2f}% of the data')
smote = SMOTE(sampling_strategy='not majority',random_state=42)
adasyn = ADASYN(random_state=42)
def test_sample_method(x_train_eng, y_train_eng, method):
if method != "":
x_train_eng, y_train_eng = method.fit_sample(x_train_eng, y_train_eng)
print(f'{method}: \n \n{y_train_eng.sum()} positive outcome & {len(x_train_eng)-y_train_eng.sum()} negative outcome')
clf = RandomForestClassifier(random_state = 42)
clf = clf.fit(x_train_eng,y_train_eng)
cv_results = cross_validate(clf, x_train_eng, y_train_eng, cv=3, scoring = ("accuracy","precision",'f1','roc_auc'))
avg_accuracy = cv_results['test_accuracy'].mean()*100
avg_precision = cv_results['test_precision'].mean()
avg_roc_auc = cv_results['test_roc_auc'].mean()
print(f'The RFC model has an average accuracy of {avg_accuracy:.2f}%')
print(f'The RFC model has an average precision of {avg_precision:.2f}')
print(f'The RFC model has an average roc_auc of {avg_roc_auc:.2f}')
return
#test orginal data performance
test_sample_method(x_train_eng, y_train_eng,"")
#test smote performance
test_sample_method(x_train_eng, y_train_eng, smote)
#test adasyn performance
test_sample_method(x_train_eng, y_train_eng, adasyn)
The prediction on data without any sampling achieved an average accuracy of 89.28% which is about 3% less than the prediction on re-sampled data, indicating that re-sampling does improve the performance of our model.
# copy the raw data and use the cleaning
df_bank = df_bank_raw.copy()
df_bank = clean_data(df_bank)
x_eng, y_eng = split_XY(df_bank)
# combine all the transformation above to one function
def engineer_data(df):
#clear unknown labels
df = clear_unknown(df)
#mark eductaion null values
df = mark_nan(df, "education")
#impute for missing values
df = impute_value(df)
#one hot encoding
df = pd.get_dummies(df)
df = df.drop(columns = ['default_no', 'default_yes','education_illiterate'])
return df
#apply all the data engineering function and split train and test data
x_eng = engineer_data(x_eng)
x_train, x_test, y_train, y_test = split_data(x_eng, y_eng)
x_train, y_train = adasyn.fit_sample(x_train, y_train)
#combine xy data for k fold validation
df_sampled_train = pd.concat([x_train, y_train], axis = 1)
df_sampled_test = pd.concat([x_test, y_test], axis = 1)
#save processed data to csv for later use
df_sampled_train.to_csv("train_data.csv",index = False)
df_sampled_test.to_csv("test_data.csv",index = False)
#raw data without any engineering
x_raw, y_raw = split_XY(df_bank_raw)
# one hot encoding
x_raw = pd.get_dummies(x_raw)
#train test split
x_train_raw, x_test_raw, y_train_raw, y_test_raw = split_data(x_raw, y_raw)
df_base_performance = pd.DataFrame(['Accuracy', 'ROC_AUC', 'precision', 'recall', 'fbeta_score'], columns = ['metrics'])
#test accuracy against the original data
#clf1 uses original data
clf1 = GaussianNB()
clf1 = clf1.fit(x_train_raw,y_train_raw)
y_test_raw_pred = clf1.predict(x_test_raw)
accuracy = clf1.score(x_test_raw,y_test_raw)
roc = roc_auc_score(y_test_raw,y_test_raw_pred)
prf_scores = precision_recall_fscore_support(y_test_raw, y_test_raw_pred, average='binary', beta = 0.5)
print(f'The Naive Bayes classifier has an test accuracy of {accuracy*100:.2f}% on the raw datasets')
df_base_performance['scores_raw_data'] = [round(x,3) for x in [accuracy] + [roc] + list(prf_scores)[:3]]
df_base_performance.style.background_gradient(axis = 1)
cm = confusion_matrix(y_test_raw, y_test_raw_pred, labels=clf1.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=clf1.classes_)
disp.plot()
plt.show();
Summary of the dataset characteristic:
Because of the above properties, we can expect supervised classification models such boosting algorithms and tree-based algorithms to work well for this dataset.
df_sampled_train = pd.read_csv("train_data.csv")
df_sampled_test = pd.read_csv("test_data.csv")
x_train_eng = df_sampled_train.loc[:, df_sampled_train.columns != 'outcome']
y_train_eng = df_sampled_train['outcome']
x_test_eng = df_sampled_test.loc[:, df_sampled_test.columns != 'outcome']
y_test_eng = df_sampled_test['outcome']
# list of classifiers
clf_list = [
#averaging
DecisionTreeClassifier(random_state = 42),
RandomForestClassifier(random_state = 42),
BaggingClassifier(random_state = 42),
#boosting
AdaBoostClassifier(random_state = 42),
GradientBoostingClassifier(random_state = 42),
XGBClassifier(random_state = 42, use_label_encoder=False),
HistGradientBoostingClassifier(random_state = 42),
#others
KNeighborsClassifier(),
GaussianNB()]
# split the data sets into 6 folders to multple trains and tests
kf = KFold(n_splits=5,random_state=42,shuffle=True)
#create the lists of performance of each classifiers
mdl = []
fold = []
accuracy = []
fbeta = []
runtime = []
roc = []
for i,(train_index, test_index) in enumerate(kf.split(df_sampled_train)):
train = df_sampled_train.iloc[train_index,:]
test = df_sampled_train.iloc[test_index,:]
print(f'FOLD {i+1} started')
for clf in clf_list:
model = clf.__class__.__name__
train_input = train.loc[:, train.columns != 'outcome']
train_output = train['outcome']
valid_input = test.loc[:, test.columns != 'outcome']
valid_output = test['outcome']
start = time()
clf = clf.fit(train_input,train_output)
pred = clf.predict(valid_input)
end = time()
accuracyScore = accuracy_score(y_true = valid_output, y_pred = pred)
fbetaScore= fbeta_score(y_true = valid_output, y_pred = pred, beta = 0.5, average = 'binary')
rocaucScore = roc_auc_score(valid_output, pred)
fold.append(i+1)
accuracy.append(accuracyScore)
fbeta.append(fbetaScore)
roc.append(rocaucScore)
runtime.append(end-start)
mdl.append(model)
print("ALL training done")
performance = pd.DataFrame({'Model': mdl, 'Accuracy':accuracy, 'Fbeta_0.5': fbeta, 'ROC_AUC':roc, 'Runtime': runtime, 'Fold':fold})
metric_list = ["Accuracy", "ROC_AUC", "Fbeta_0.5", "Runtime"]
for m in metric_list:
plt.subplots(figsize=(13,8))
sns.lineplot(x="Fold", y=m, hue = "Model", data=performance).set_title(m)
plt.legend(loc = 'lower right', prop={'size': 12})
plt.show();
# visualise performance in each metric
performance.groupby(by = ['Model']).mean().iloc[:,:-1].sort_values(by = 'Accuracy', ascending = False).style.background_gradient()
After sorting by accuracy, we found the 3 best models are:
# list of selected classifiers
new_clf_list = [
HistGradientBoostingClassifier(random_state = 42),
RandomForestClassifier(random_state = 42),
XGBClassifier(random_state = 42, use_label_encoder=False)]
df_untunned_performance = pd.DataFrame(['Accuracy', 'ROC_AUC', 'precision', 'recall', 'fbeta_score'], columns = ['metrics'])
for clf in new_clf_list:
model = clf.__class__.__name__
clf = clf.fit(x_train_eng, y_train_eng)
y_pred = clf.predict(x_test_eng)
accuracy = clf.score(x_test_eng,y_test_eng)
roc = roc_auc_score(y_test_eng,y_pred)
prf_scores = precision_recall_fscore_support(y_test_eng,y_pred, average='binary', beta = 0.5)
df_untunned_performance[model] = [round(x,4) for x in [accuracy] + [roc] + list(prf_scores)[:3]]
print(f'{model} finished')
df_untunned_performance.set_index('metrics').T.sort_values(by = 'Accuracy', ascending = False).style.background_gradient()
param_grid = {'n_estimators': np.linspace(100,1000,5).astype(int),
'criterion': ['entropy', 'gini']}
clf = GridSearchCV(RandomForestClassifier(random_state = 42), param_grid, cv = 3, scoring= 'accuracy')
clf = clf.fit(x_train_eng, y_train_eng)
print(clf.best_params_)
print(np.abs(clf.best_score_))
#setting grid of selected parameters for iteration
param_grid = {
'eta': [ 0.05, 0.1, 0.15, 0.2],
'gamma':[0, 0.2, 0.4, 0.6, 0.8, 1],
'n_estimators': np.linspace(100,1000,5).astype(int)}
clf = GridSearchCV(XGBClassifier(random_state = 42, use_label_encoder=False), param_grid, cv = 3, scoring= 'accuracy')
clf = clf.fit(x_train_eng, y_train_eng)
print(clf.best_params_)
print(np.abs(clf.best_score_))
param_grid = {'max_depth':[4,6,8],
'max_leaf_nodes':np.arange(3,23,2),
'learning_rate':[0.05, 0.1, 0.15, 0.2]}
clf = GridSearchCV(HistGradientBoostingClassifier(random_state = 42), param_grid, cv = 3, scoring= 'accuracy')
clf = clf.fit(x_train_eng, y_train_eng)
print(clf.best_params_)
print(np.abs(clf.best_score_))
# list of selected tuned classifiers
tuned_clf_list = [
HistGradientBoostingClassifier(learning_rate = 0.21, max_depth = 25, max_leaf_nodes = 34, random_state = 42),
RandomForestClassifier(criterion = "entropy", n_estimators = 1000, random_state = 42),
XGBClassifier(eta = 0.01, gamma = 0, n_estimators = 1000, random_state = 42, use_label_encoder=False)]
df_tunned_performance = pd.DataFrame(['Accuracy', 'ROC_AUC', 'precision', 'recall', 'fbeta_score'], columns = ['metrics'])
for clf in tuned_clf_list:
model = clf.__class__.__name__
clf = clf.fit(x_train_eng, y_train_eng)
y_pred = clf.predict(x_test_eng)
accuracy = clf.score(x_test_eng,y_test_eng)
roc = roc_auc_score(y_test_eng,y_pred)
prf_scores = precision_recall_fscore_support(y_test_eng,y_pred, average='binary', beta = 0.5)
df_tunned_performance[model] = [round(x,4) for x in [accuracy] + [roc] + list(prf_scores)[:3]]
print(f'{model} finished')
df_tunned_performance.set_index('metrics').T.sort_values(by = 'Accuracy', ascending = False).style.background_gradient()
We will use voting method to combine classifiers, and use voting to balance out the weaknesses of each individual model to increase total accuracy.
Because we already have un-tunned models and tunned models at hand. We can compare the performance of the majority vote (hard voting) method and the average predicted probabilities (soft voting) method.
df_vote_performance = pd.DataFrame(['Accuracy', 'ROC_AUC', 'precision', 'recall', 'fbeta_score'], columns = ['metrics'])
# list of selected classifiers
new_clf_list = [
HistGradientBoostingClassifier(random_state = 42),
RandomForestClassifier(random_state = 42),
XGBClassifier(random_state = 42, use_label_encoder=False)]
vclf1 = VotingClassifier(estimators=[('rf', new_clf_list[0]), ('xgb', new_clf_list[1]), ('hgb', new_clf_list[2])], voting='hard')
vclf1 = vclf1.fit(x_train_eng, y_train_eng)
y_pred = vclf1.predict(x_test_eng)
accuracy = vclf1.score(x_test_eng,y_test_eng)
roc = roc_auc_score(y_test_eng,y_pred)
prf_scores = precision_recall_fscore_support(y_test_eng,y_pred, average='binary', beta = 0.5)
df_vote_performance['hard_vote'] = [round(x,4) for x in [accuracy] + [roc] + list(prf_scores)[:3]]
df_vote_performance.style.background_gradient(axis = 1)
tuned_clf_list = [
HistGradientBoostingClassifier(learning_rate = 0.21, max_depth = 25, max_leaf_nodes = 34, random_state = 42),
RandomForestClassifier(criterion = "entropy", n_estimators = 1000, random_state = 42),
XGBClassifier(eta = 0.01, gamma = 0, n_estimators = 1000, random_state = 42, use_label_encoder=False)]
vclf2 = VotingClassifier(estimators=[('rf', tuned_clf_list[0]), ('xgb', tuned_clf_list[1]), ('hgb', tuned_clf_list[2])], voting='soft')
vclf2 = vclf2.fit(x_train_eng, y_train_eng)
y_pred = vclf2.predict(x_test_eng)
accuracy = vclf2.score(x_test_eng,y_test_eng)
roc = roc_auc_score(y_test_eng,y_pred)
prf_scores = precision_recall_fscore_support(y_test_eng,y_pred, average='binary', beta = 0.5)
df_vote_performance['soft_vote'] = [round(x,4) for x in [accuracy] + [roc] + list(prf_scores)[:3]]
df_vote_performance.style.background_gradient(axis = 1)
pd.concat([df_base_performance,df_vote_performance.iloc[:,1:]], axis = 1).style.background_gradient(axis = 1)
From the comparison between the base model, hard-voting model and soft-voting model, we found that soft voting has the highest accuracy and hard-voting model has the highest precision score.
Both voting models have out-performed the base model, improved the accuracy by 5% approximately and precision score by around 0.2. The project goal of improving accuracy and precision score is achieved.
The voting classfiers performances are also slightly better than the tunned single models.